Journal: Genome Biology
Article Title: KPop: accurate and scalable comparative analysis of microbial genomes by sequence embeddings
doi: 10.1186/s13059-025-03585-8
Figure Lengend Snippet: Data preprocessing workflow. When analyzing NGS datasets with KPop, one can optionally pre-process sequencing reads in order to eliminate biases and/or have the method focus on specific parts of the genome. For instance, one might align reads to a (pan-)genome and separate them into reads that align (likely to originate from the organism being studied) and reads that do not (likely to come from contaminations). Furthermore, reads that do map to the pan-genome might be separated into groups specific to different genomic features; for instance, one might align them to a set of MLST genes or AMR genes. Full k -mer spectra would then be separately obtained from each group of reads (contaminations, pan-genomic, MLST genes, AMR genes) and given as input to downstream/classification methods. The choice of the group of reads from which spectra are computed determines the set of sequences seen by the method, and hence the scope of the classification
Article Snippet: In order to do so, simulated next-generation sequencing (NGS) data was generated for each genome using ART [ ], emulating Illumina HiSeq 2500 paired-end reads of length 150 bp with an average coverage of 20-fold.
Techniques: Sequencing